Welcome to the Deep Learning Optimization and Deployment using TensorRT lab! In this lab you will learn how to increase the throughput of your models and decrease their latency during the inference stage. We will utilize TF-TRT as a high-performance inference engine optimizer and deployment tool for TensorFlow.
After completing this course, you will be able to:
This lab, requires the students to have prior knowledge in Deep Learning, in addition to the following topics:
Before we get started, there are a few items to consider on this jupyter notebook:
The notebook is being rendered on your browser, but the contents are being streamed by an interactive iPython kernel running on an AWS EC2 GPU enabled instance.
The notebook is composed of cells; cells can contain code which you can run, or they can hold text and/or images which are there for you to read.
You can execute code cells by clicking the Run icon in the menu, or via the following keyboard shortcuts Shift-Enter (run and advance) or Ctrl-Enter (run and stay in the current cell).
To interrupt cell execution, click the Stop button on the toolbar or navigate to the Kernel menu, and select Interrupt.
Run the following markdown cell and make sure your browser supports web sockets required to run this lab.
1 + 1 # i'm a code cell -- if you run me I'll perform some computations and show you their result below
This tutorial covers the following topics:
create_inference_graphDeep learning models are applied to problems in a variety of domains including automotive, intelligent video analytics and live recommendation systems. Most modern deep learning models have millions of parameters that formulate an optimization problem. In addition, live inference engines are required to process multiple data sources simultaneously (e.g. multiple camera inputs), resulting in demand for massive computational power. As the complexity and size of these models grow, the need for model optimization is inevitable.
Another factor stipulating a more delicate approach towards model deployment is data growth. It is estimated that by the year 2020, the digital data will outreach 30,000 Exabytes which requires deep learning models to be carefully designed to be able to handle such an intense growth. As an example, in image processing applications, not only the number of cameras is rapidly growing, but so is the image resolution.

Figure 1. Current commercial state of the art video features resolutions up to 4k, roughly 3840x2160 pixels.
TensorRT is designed to help deploy deep learning for these use-cases. With support for every major framework, TensorRT helps process large amounts of data with low latency through powerful optimizations, use of reduced precision, and efficient memory utilization.

Figure 2. TensorRT comprises of two stages, Model optimization (left) and TensorRT target runtime (right)
TensorRT provides an optimization and deployment interface for deep learning frameworks and libraries and enables users to generate optimized models and consequently increase efficiency and performance of deep learning models and decrease the cost of deployment. TensorRT supports a variety of frameworks such as Caffe 2, Chainer, Microsoft Cognitive Toolkit, MxNet and PyTorch into TensorRT. In addition, TensorFlow and TensorRT are tightly integrated.
There are two phases in the use of TensorRT: optimization and real-time inference. In the optimization phase, TensorRT performs optimizations on the network configuration and generates an optimized plan for computing the forward pass through the deep neural network. The plan is an optimized object code that can be serialized and stored in memory or on disk.
The real-time inference phase takes the form of a long running service or user application that accepts batches of input data, performs inference by executing the plan on the input data and returns batches of output data (classification, object detection, etc.). With TensorRT, you don’t need to install and run a deep learning framework on the deployment hardware. Discussion of the architecture and deployment of the inference service is not a topic of this lab; instead, we will focus on how to use TensorRT for optimizing TensorFlow models.
To start, let's run the following cell to get the TensorRT version running on this machine:
!dpkg -l | grep nvinfer
!ls
While this lab revolves around TF-TRT, TensorRT provides an [ONNX parser] as well (https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/parsers/Onnx/pyOnnx.html), which ingests a saved model, optimizes it, and returns the optimized model for applications using ONNX models. This method of optimization works outside the framework and is not part of this lab.

Figure 3. A common pipeline for ONNX model optimization
In this lab, we focus on TensorRT integration with TensorFlow (known as TF-TRT) where the optimization process happens within the TensorFlow framework. While the ONNX TensorRT optimization requires the whole graph to be supported and optimized, TF-TRT works on subgraphs of the original graph and allows TensorFlow to execute on the remaining section(s) of the graph. In practice, TF-TRT optimizes the largest subgraphs possible in the TensorFlow graph. A larger size of the subgraph leads to a greater performance by optimizing more layers of the model.
Based on the graph structure, the optimization may happen in different subsections of the graph and you can define the minimum size of layers to be optimized when calling the API.
TF-TRT is part of TensorFlow installation binary and when you install TensorFlow-gpu, you will be able to use TensorRT as well.
TF-TRT needs the following to deploy a classification neural network:
A Graph definition object (GraphDef), and
Trained weights (E.g. checkpoint saved file),
In addition, you must define the batch size and specify the output layers. Later, we will see how to choose these parameters for a particular model to create a TensorRT optimized object .
In this lab, we are going to base our optimized models on the Cityscapes dataset. Cityscapes contains a diverse set of city side landscapes from several cities in Germany. There are 5000 annotated objects, making the dataset a great case-study for segmentation algorithms. There are also 20K annotated frames.
For the first part of this lab, we are going to analyze the frames using classification models. While the objective is overly simplified, the aim is to focus on learning fundamentals of optimizing models using TF-TRT. Another reason for beginning with classification models is their high level of compatibility compared to detection and segmentations models.
Within the second part of the lab, we focus on more realistic scenarios of multiple object detection within each frame, which require implementing custom operations. The resulting models demonstrate potential applications in smart-city management, automotive and traffic handling. This same optimization methodology could be applied to models specific to other domains.
In this section, we will review how to import a trained model and run test images through a TensorFlow session. Later, we optimize the same model and compare the results. We assume the learner is familiar with Python and TensorFlow, however comments and code descriptions are provided when necessary.
When optimizing a TensorFlow model, TF-TRT can optimize either a subgraph or the entire graph definition. This capability allows the optimization procedure to be applied to the graph where possible and skip the non-supported graph segments. As a result, if the existing model contains a non-supported layer or operation, TensorFlow can still optimize the graph.
Below, you can see a typical workflow of TF-TRT:

Figure 4. TF-TRT optimization workflow
To see the list of operations supported by TF-TRT visit the following link: https://docs.nvidia.com/deeplearning/dgx/integrate-tf-trt/index.html#support-ops
If your model includes operations not listed in the link, the optimized model may still work fine, as it is designed to delegate those operations to TensorFlow.
To begin, we must import the packages required for this lab.
from __future__ import absolute_import
%matplotlib inline
# Importing Matplotlib which is a plotting library for the Python programming
# language and its numerical mathematics extension NumPy
from matplotlib import pyplot as plt
# libraries to read json config files
import argparse
import json
# Helper function for downloading models and datasets
from tensorrt.helper import download_model, download_dataset
from tensorrt.helper import MODELS as models
# urllib2 for http downloads
try:
import urllib2
except ImportError:
import urllib.request as urllib2
# tensorflow libraries
import tensorflow as tf
import tensorflow.contrib.tensorrt as trt
# tqdm timing library
import tqdm
import pdb
# more common python libraries
from collections import namedtuple
from PIL import Image
import numpy as np
import time
import subprocess
import os
import glob
from os.path import join
# OpenCV library
import cv2
# more helper functions for detection tasks
from tensorrt.graph_utils import force_nms_cpu as f_force_nms_cpu
from tensorrt.graph_utils import replace_relu6 as f_replace_relu6
from tensorrt.graph_utils import remove_assert as f_remove_assert
from google.protobuf import text_format
from object_detection.protos import pipeline_pb2, image_resizer_pb2
from object_detection import exporter
# resnet specific configuration parameters
from nets import resnet_v2
# classification tasl helper files
import shutil
import nets.nets_factory
import tensorflow.contrib.slim as slim
import official.resnet.imagenet_main
from preprocessing import inception_preprocessing, vgg_preprocessing
from classification.tf_trt_models.classification import download_classification_checkpoint, build_classification_graph
from nets import vgg
from tensorflow.python.platform import gfile
from datasets import imagenet
from tensorflow.contrib import slim
# inception specific configuration parameters
from nets import inception
from nets import inception_utils
from preprocessing import inception_preprocessing
import sys
print(sys.path)
import object_detection
print(object_detection.__file__)
import matplotlib
We will be using TensorFlow Slim library extensively to benchmark the optimization process. Slim is a lightweight library by which you can train and evaluate several widely used deep-learning models. It also provides pre-trained checkpoints that can be used/deployed without further ado.
In order to facilitate loading frozen models and measuring benchmarks of the models, we have included a helper file classification_helper.py for you to review and get familiarized with the functionality before proceeding to the next step.
To process each image, we need to convert them into numpy arrays. Image.open is a convenient function to read raw images:
def get_image(image_path, width, height):
"""Getting the image in Tensor form
image_path: absolute path of the image
width: desired width after transformation
height: desired height after transformation
Returns numpy array of resized image
"""
return np.array(Image.open(image_path).resize((width, height)))
One way to parallelize the computation at the inference stage is to use batch inference.
TensorFlow uses CUDA in order to optimize the GPU memory management, and since typically memory IO is a bottleneck during the inference, it is wise to work with batch sizes that maximize the throughput of your model. Batch sizes of multiples of 32 usually provide optimal stacks of images for V100 and Tesla T4 GPUs to process. This is due to the size of especial kernels that TensorRT uses for matrix multiplications.
Note that different batch sizes have no effect on other KPIs (like acuuracy) but the overal throughput. When optimizing your neural network model using TensorRT, you would need to provide the batch size for which you would like your models optimized. The get_directory_images function is used to stack the image arrays for inference:
def get_directory_images(image_directory,
batch_size,
image_width,
image_height):
""" get the image batch
image_directory: the directory containing the image list
batch_size: maximum number of images to stack
image_width: width of the images after pre-processing to match the model input
image_height: height of the images after pre-processing to match the model input
Returns the image batch array
"""
outputs = np.empty([0,image_width, image_height, 3])
count = 0
image = []
for root, dirs, files in os.walk(image_directory):
# read up to a specified number of files
for filename in files:
# if number of images is bigger than then batch size stop there!
if count == batch_size:
break
file_path=os.path.join(root, filename)
outputs = np.concatenate([outputs, get_image(file_path, image_width, image_height)[None, ...]])
count = count + 1
return outputs
To work with classification models, we need an implementation of the model which describes the layers and operations defining the architecture of the deep network. In addition, we need to initialize the model weights with pre-trained values. One way to save model weights is to take a snapshot of the model during training called checkpoints and reuse the on later runs. TensorFlow's official and slim repositories contain such snapshots for download.
The download_classification_checkpoint function included in the helper file classification.py is used below to load the checkpoints. Later, we will use the downloaded checkpoint file to retrieve the weights.
You may choose your model from the following classification models:
Let's take a look at how our model performs prior to graph optimization. We use resnet_v2_152 as an example. Later, you will be asked to benchmark other models as exercise.
First, let's define the parameters describing the model name and paths and also the label file and image paths:
MODEL = 'resnet_v2_152'
CHECKPOINT_PATH = 'resnet_v2_152.ckpt'
#number of classes for a certain dataset
NUM_CLASSES = 1001
LABELS_PATH = 'classification/examples/classification/data/imagenet_labels_%d.txt' % NUM_CLASSES
image_paths = './tensorrt/coco/CS'
Below, we are going to download the checkpoint. Depending on the model you choose, it may take a few minutes for the process to complete.
checkpoint_path = download_classification_checkpoint(MODEL, 'classification/data')
Once the checkpoint is downloaded, we need to obtain the graph itself and initialize the weights using the checkpoint. To accomplish this, we are using another helper function named build_classification_graph defined in classification.py. A frozen graph combines the weights and model architecture to build a complete graph. Note that you can also save the frozen graph on disk using graph's SerializeToString() method.
Note: Some TensorFlow operations may throw WARNING messages, many of which are related to internal functionalities of TensorFlow and slim and may safely be ignored.
frozen_graph, input_names, output_names = build_classification_graph(
model=MODEL,
checkpoint=checkpoint_path,
num_classes=NUM_CLASSES
)
The frozen_graph object contains the serialized GraphDef protocol buffer. Next, we need to extract individual objects of the graph definition into Operation and Tensor objects using TensorFlow's import_graph_def command. Also, note that since we are going to call this function frequently, we need to erase the default graph stack and remove the default graph using reset_default_graph function:
tf.reset_default_graph()
tf_config = tf.ConfigProto()
tf_config.gpu_options.allow_growth = True
tf_sess = tf.Session(config=tf_config)
tf.import_graph_def(frozen_graph, name='')
tf_input = tf_sess.graph.get_tensor_by_name(input_names[0] + ':0')
tf_output = tf_sess.graph.get_tensor_by_name(output_names[0] + ':0')
width = int(tf_input.shape.as_list()[1])
height = int(tf_input.shape.as_list()[2])
Let's take a look at the nodes of the loaded graph. Note that depending on the model, the list may become very long. When you are optimizing your model using the create_inference_graph function, you would need to provide the network output names to the function. You can use this code snippet to find the output names.
[n.name for n in tf.get_default_graph().as_graph_def().node]
In order to test our model, we are going to define the run_test function which inputs the batch size, and using methods we already covered, imports a stack of images and runs the inference with the loaded graph. Next, the resulting classes are sorted increasingly based on their scores and their top 5 are printed.
def run_test(batch_size=32):
print("Getting list of test images...")
inputs = get_directory_images(image_paths,batch_size,width, height)
print("Reading labels...")
with open(LABELS_PATH, 'r') as f:
labels = f.readlines()
print("Running the session over a batch size of: ", batch_size)
tic = time.time()
output = tf_sess.run(tf_output, feed_dict={
tf_input: inputs
})
toc = time.time()
t_diff = toc - tic
print("TOTAL TIME:", t_diff)
print("Getting the resulted classes:")
for index in range(batch_size):
print(index)
plt.figure(figsize=(6,3))
plt.imshow(inputs[index,...].astype(np.uint8))
plt.axis('off')
plt.show()
scores = output[index]
top5_idx = scores.argsort()[::-1][0:5]
for i in top5_idx:
print('(%3f) %s' % (scores[i], labels[i]))
Finally, we are going to call the run_test method over a batch size of 32 to observe how this model performs before optimization.
run_test(32)
Note: Due to operations like initializing memory allocators and GPU initialization, the first run is relatively slower and should not be benchmarked. To get more accurate comparison, you need to either discard the first run of the inference process or average the time over many runs.
Take a note of the TOTAL TIME, as we are going to compare it to post-optimization graph in the next section.
In this section we provide a brief overview of the graph optimization procedure by TensorRT. TensorRT performs several important transformations and optimizations to the neural network graph. First, layers with unused outputs are eliminated to avoid unnecessary computation. Next, where possible convolution, bias, and ReLU layers are fused to form a single layer. Figure 5 shows a typical convolutional network before optimization:

Figure 5. An example convolutional model with multiple convolutional and activation layers before optimization
Figure 6 shows the result of this vertical layer fusion on the original network from Figure 5 (fused layers are labeled CBR in Figure 6). Layer fusion improves the efficiency of running TensorRT-optimized networks on the GPU.

Figure 6. Fusing blocks into a single layer
Another transformation is horizontal layer fusion, or layer aggregation, along with the required division of aggregated layers to their respective outputs, as Figure 7 shows. Horizontal layer fusion improves performance by combining layers that take the same source tensor and apply the same operations with similar parameters, resulting in a single larger layer for higher computational efficiency. The example in Figure 7 shows the combination of 3 1×1 CBR layers from Figure 6 that take the same input into a single larger 1×1 CBR layer. Note that the output of this layer must be disaggregated to feed into the different subsequent layers from the original input graph.

Figure 7. Horizontal layer fusion
TensorRT performs its transformations during the build phase transparently to the API user after the TensorRT parser reads in the trained network and configuration file.
create_inference_graph¶In this section, we are using TF-TRT API (create_inference_graph) to optimize the GraphDef object of our model. We are going to review the most important parameters of the function. For a complete list of parameters please visit https://docs.nvidia.com/deeplearning/dgx/integrate-tf-trt/index.html.
def create_inference_graph(input_graph_def,
outputs,
max_batch_size=1,
max_workspace_size_bytes=2 << 20,
precision_mode="fp32",
minimum_segment_size=3,
is_dynamic_op=False,
maximum_cached_engines=1,
cached_engine_batches=[]
rewriter_config=None,
input_saved_model_dir=None,
input_saved_model_tags=None,
output_saved_model_dir=None,
session_config=None)
where:
input_graph_def: This parameter is the GraphDef object that contains the model to be transformed.
outputs: This parameter lists the output nodes in the graph. Tensors which are not marked as outputs are considered to be transient values that may be optimized away by the builder.
max_batch_size: This parameter is the maximum batch size that specifies the batch size for which TensorRT will optimize. At runtime, a smaller batch size may be chosen. At runtime, larger batch size is not supported.
max_workspace_size_bytes: TensorRT operators often require temporary workspace. This parameter limits the maximum size that any layer in the network can use. If insufficient scratch is provided, it is possible that TensorRT may not be able to find an implementation for a given layer.
precision_mode: This parameter sets the precision mode; which can be one of fp32, fp16, or int8. Precision lower than FP32, meaning FP16 and INT8, would improve the performance of inference. The FP16 mode uses Tensor Cores or half precision hardware instructions, if possible. The INT8 precision mode uses integer hardware instructions.
minimum_segment_size: This parameter determines the minimum number of TensorFlow nodes in a TensorRT engine, which means the TensorFlow subgraphs that have fewer nodes than this number will not be converted to TensorRT. Therefore, in general smaller numbers such as 5 are preferred. This can also be used to change the minimum number of nodes in the optimized INT8 engines to change the final optimized graph to fine tune result accuracy.
Below, we call create_inference_graph to optimize the graph.
# you may choose FP16 and FP32 precision modes.
# INT8 involves additional steps that will be discussed later in Exercise 3.
p_mode = 'FP16'
trt_graph = trt.create_inference_graph(
input_graph_def=frozen_graph,
outputs=output_names,
max_batch_size=32,
max_workspace_size_bytes=1 << 25,
precision_mode=p_mode,
minimum_segment_size=50
)
We need to reset the graph again prior to replacing the existing graph
tf.reset_default_graph()
tf_config = tf.ConfigProto()
tf_config.gpu_options.allow_growth = True
tf_sess = tf.Session(config=tf_config)
tf.import_graph_def(trt_graph, name='')
tf_input = tf_sess.graph.get_tensor_by_name(input_names[0] + ':0')
tf_output = tf_sess.graph.get_tensor_by_name(output_names[0] + ':0')
width = int(tf_input.shape.as_list()[1])
height = int(tf_input.shape.as_list()[2])
With the new graph replaced, let's take a look at the active graph's node structure:
[n.name for n in tf.get_default_graph().as_graph_def().node]
Depending on the model, the output should vary from what we saw before and may look like:
['input',
'sub/y',
'mul/x',
'resnet_v2_152/Pad/paddings',
'ConstantFolding/truediv_recip',
'PermConstNHWCToNCHW-LayoutOptimizer',
'truediv',
'sub',
'mul',
'resnet_v2_152/Pad',
'resnet_v2_152/conv1/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer',
'resnet_v2_152/my_trt_op_0',
'resnet_v2_152/SpatialSqueeze',
'scores']
As you see some of the layers have been replaced with TRT layers (postfixed by LayoutOptimizer). Now let's run the test again and benchmark the new model's performance:
run_test(32)
After ignoring the first run, with the resnet_v2_152 model, you should see a performance improvement of ~400% for a batch size of 32 and the main resulting classes remain the same as before optimization. Remember that the image aspect ratios are not maintained in this implementation, impacting the overall class quality (cropping the image in a real-world scenario is a potential solution), however what should be perceived here is the fact that the highest probability classification results are regenerated optimized model. Later in Exercise 2, you will see improved results by utilizing the VGG 19 model.
One of the important parameters of the optimization phase is the max_batch_size. While during inference, you may choose a different batch size to stack your images, the best performance is achieved when the exact same number as the max_batch_size is used. A smaller batch size may lead to poor performance and might significantly decrease the efficiency of your model. Larger numbers are not supported by TensorRT. In this exercise, we are going to examine this parameter.
Compare the inference time for 8, 16 and 32 batch sizes and their combinations with FP16 and FP32 precision modes.
# PERFORM THE COMPARISON HERE (total inference time):
Batch \ Precision |
size | FP16 FP32
------------------------------------------------------------
|
8 |
|
16 |
|
32 |
|
The success of a TRT optimization task is also dependent on the architecture of the model. The more supporting layers comprising the model, the greater number of TRT layers generated and consequently the higher performance is achieved. Also, minimum_segment_size parameter determines when to generate the TRT layers based on the consequent number of supporting layers. In this exercise you are asked to optimize the vgg_19 and inception_v4 models varying minimum_segment_size to maximize throughput (on FP16 mode only). You can achieve this by setting the MODEL and CHECKPOINT_PATH parameters together with changing the create_inference_graph's minimum_segment_size parameter:
# PERFORM THE COMPARISON HERE (total inference time - FP16 only):
Model \ Segment |
size | 1 5
----------------------------------------------
|
vgg_19 |
|
|
inception_v4 |
|
Typically, model training is performed using 32-bit floating point mathematics. Due to the backpropagation algorithm and weights updates, this high precision is necessary to allow for model convergence. Once trained, inference could be done in reduced precision (e.g. FP16) as the neural network architecture only requires a feed-forward network. Reducing numerical precision allows for a smaller model with faster inferencing time, lower memory requirements, and more throughput.
Moreover, the NVIDIA Pascal and Volta GPUs are capable of executing 8-bit integer 4-element vector dot product instructions to accelerate deep neural network inference (see Figure 9).

Figure 9. The DP4A instruction: 4-element dot product with accumulation.
While this new instruction provides faster computation, there is a significant challenge in representing weights and activations of deep neural networks in this reduced INT8 format. As Table 1 shows, the dynamic range and granularity of representable values for INT8 is significantly limited compared to FP32 or FP16.

Table 1. Dynamic range of FP32, FP16 and INT8.
By using trt.calib_graph_to_infer_graph function, you can quickly convert a model trained using FP32 into INT8 for deployment with negligible accuracy loss. However, the main challenge is to find the correct dynamic range of the inputs. TensorRT uses a calibration process that minimizes the information loss when approximating the FP32 network with a limited 8-bit integer representation.
When preparing the calibration dataset, you should capture the expected distribution of data in typical inference scenarios. You need to make sure that the calibration dataset covers all the expected scenarios; for example, clear weather, rainy day, night scenes, etc. When examining your own dataset, you should create a separate calibration dataset. The calibration dataset shouldn’t overlap with the training, validation or test datasets.
For more information on INT8 inference, you may watch the 8-Bit inference using TensorRT presentation.
So far, we have seen how replacing full precision mode with FP16 affects throughput of the model. In this exercise, we are going to have a dry run on INT8 inference.
As mentioned above, INT8 inference mode, includes one additional calibration step by calling the 'trt.calib_graph_to_infer_graph' function on our optimized model. However, running this function right after create_inference_graph would result in:
FailedPreconditionError: Need to run graph with calibration data first!
To remedy the issue, you would need to run the tf.session with calibration data first (simply calling the run_test method in our case). The choice of calibration data is crucial as it should manifest the overall distribution of your image data. Consider providing different weather, environmental and lighting conditions when collecting the calibration data. Once calibration data is collected, you need to run the inference and finally procure the optimized model by calling the create_inference_graph function. Bellow, you can see a summary of the steps required to create an INT8 inference model:
create_inference_graph method with INT8 precision mode.p_mode = 'INT8'
trt_graph = trt.create_inference_graph(
...
)
Reset the graph using reset_default_graph and import the new trt_graph using import_graph_def. Look at the code examples above for hints.
update the tf_input and tf_input with the new model inputs and outputs
Call the run_test function to run the model on the “calibration data”. Note that for brevity, we are using the same test image set for calibration. However, this is not a realistic assumption, and, in your application, you must use an inclusive dataset representing a wide range of potential data.
Get the calibrated graph using the following command:
calibrated_model = trt.calib_graph_to_infer_graph(trt_graph)
Reset the graph once again, and this time import the new calibrated_model
update the tf_input and tf_input with the new model inputs and outputs
Re-run the test. Note that first run would take longer than normal again.
# YOUR CODE GOES HERE:
If you get stuck, the solution is provided here
While the throughput improvement of the classification model was significant, it could not serve a real-world application due to the lack of useful class information within frames. Smart cities require more refined object classification than the overall image class labels. We are often interested in finding more objects within the image together with their exact locations. To achieve this goal, we need models that perform bounding-box detection or segmentation. These models possess a more diverse set of operations, and while many of these TF layers and operations are supported by TensorRT, from time to time you would need to implement custom layers that are not supported in TRT.
In such cases, TensorRT functionality could be extended to implement the specific output or layer. Custom layers, often referred to as plugins, are implemented and instantiated by an application, and their lifetime must span their use within a TensorRT engine. Below we are going to see how a custom operation for Relu6 activation function could be implemented in Python. Below, you can see output of the next optimization task. You can compare the overall detection time before and after optimization.

Figure 10. Object detection before optimization (left) and after optimization (right)
The Relu6 activation function was first introduced in Convolutional Deep Belief Networks on CIFAR-10. The purpose of the Relu6 activation function is to put a limit on the upper bound of the output value, which results in lower bit rate for 'Q' in a Q.f floating points, leaving ⅘ bits for the 'f' part.
We can formulate Relu6 by the following formula:
Relu6(val) = min(max(val, 0), 6) = Relu(x) - Relu(x-6)
Relu6 is not supported in TensorRT as of yet, and we need to implement a custom function that replaces the Relu6 by a custom Relu-based implementation. The operation would look like the following TensorFlow snippet:
graph = tf.Graph()
with graph.as_default():
for node in graph_def.node:
if node.op == 'Relu6':
input_name = node.input[0]
output_name = node.name
tf_x = tf.placeholder(tf.float32, [10, 10], name=input_name)
tf_6 = tf.constant(dtype=tf.float32, value=6.0, name=const6_name)
with tf.name_scope(output_name):
tf_y1 = tf.nn.relu(tf_x, name='relu1')
tf_y2 = tf.nn.relu(tf.subtract(tf_x, tf_6, name='sub1'), name='relu2')
#tf_y = tf.nn.relu(tf.subtract(tf_6, tf.nn.relu(tf_x, name='relu1'), name='sub'), name='relu2')
#tf_y = tf.subtract(tf_6, tf_y, name=output_name)
tf_y = tf.subtract(tf_y1, tf_y2, name=output_name)
graph_def = graph.as_graph_def()
graph_def.node[-1].name = output_name
for node in graph_def.node:
if node.name == input_name:
graph_def.node.remove(node)
for node in graph_def.node:
if node.name == const6_name:
graph_def.node.remove(node)
We simply traverse the GraphDef structure through the for node in graph_def.node: loop and replace every Relu6 node by a modified version of Relu. This is simple implementation for a custom activation function. To utilize our implementation of Relu6 function, we are optimizing the ssdlite_mobilenet_v2model which is trained over the coco dataset. You can review the configuration of the optimization plan defined in model_config.json.
Let's take a look at the model optimization configuration file:
config_path = join('tensorrt', 'model_config.json')
with open(config_path, 'r') as f:
test_config = json.load(f)
print(json.dumps(test_config, sort_keys=True, indent=4))
Model = namedtuple('Model', ['name', 'url', 'extract_dir'])
INPUT_NAME = 'image_tensor'
BOXES_NAME = 'detection_boxes'
CLASSES_NAME = 'detection_classes'
SCORES_NAME = 'detection_scores'
MASKS_NAME = 'detection_masks'
NUM_DETECTIONS_NAME = 'num_detections'
FROZEN_GRAPH_NAME = 'frozen_inference_graph.pb'
PIPELINE_CONFIG_NAME = 'pipeline.config'
CHECKPOINT_PREFIX = 'model.ckpt'
First, we need to download the checkpoint for our model:
config_path, checkpoint_path = download_model(**test_config['source_model'])
print(config_path, checkpoint_path)
You may notice that in addition to Relu6, we have defined other customization operations, e.g. replacing nms_score_threshold (non-max suppression score threshold) with the constant value of 0.3 which forces a stricter IoU policy compared to the default operation (for more info see [here].(https://www.tensorflow.org/api_docs/python/tf/image/non_max_suppression)).
Once the GraphDef is modified to replace the non-supported operations, we perform the TRT optimization process.
Below, you can find the optimization function which is very similar to the optimize_graph defined above:
def optimize_model(config_path,
checkpoint_path,
use_trt=True,
force_nms_cpu=True,
replace_relu6=True,
remove_assert=True,
override_nms_score_threshold=None,
override_resizer_shape=None,
max_batch_size=1,
precision_mode='FP32',
minimum_segment_size=50,
max_workspace_size_bytes=1 << 25,
calib_images_dir=None,
num_calib_images=None,
calib_image_shape=None,
tmp_dir='.optimize_model_tmp_dir',
remove_tmp_dir=True,
output_path=None):
"""Optimizes an object detection model using TensorRT
Optimizes an object detection model using TensorRT. This method also
performs pre-tensorrt optimizations specific to the TensorFlow object
detection API models. Please see the list of arguments for other
optimization parameters.
Args
----
config_path: A string representing the path of the object detection
pipeline config file.
checkpoint_path: A string representing the path of the object
detection model checkpoint.
use_trt: A boolean representing whether to optimize with TensorRT. If
False, regular TensorFlow will be used but other optimizations
(like NMS device placement) will still be applied.
force_nms_cpu: A boolean indicating whether to place NMS operations on
the CPU.
replace_relu6: A boolean indicating whether to replace relu6(x)
operations with relu(x) - relu(x-6).
remove_assert: A boolean indicating whether to remove Assert
operations from the graph.
override_nms_score_threshold: An optional float representing
a NMS score threshold to override that specified in the object
detection configuration file.
override_resizer_shape: An optional list/tuple of integers
representing a fixed shape to override the default image resizer
specified in the object detection configuration file.
max_batch_size: An integer representing the max batch size to use for
TensorRT optimization.
precision_mode: A string representing the precision mode to use for
TensorRT optimization. Must be one of 'FP32', 'FP16', or 'INT8'.
minimum_segment_size: An integer representing the minimum segment size
to use for TensorRT graph segmentation.
max_workspace_size_bytes: An integer representing the max workspace
size for TensorRT optimization.
calib_images_dir: A string representing a directory containing images to
use for int8 calibration.
num_calib_images: An integer representing the number of calibration
images to use. If None, will use all images in directory.
calib_image_shape: A tuple of integers representing the height,
width that images will be resized to for calibration.
tmp_dir: A string representing a directory for temporary files. This
directory will be created and removed by this function and should
not already exist. If the directory exists, an error will be
thrown.
remove_tmp_dir: A boolean indicating whether we should remove the
tmp_dir or throw error.
output_path: An optional string representing the path to save the
optimized GraphDef to.
Returns
-------
A GraphDef representing the optimized model.
"""
if os.path.exists(tmp_dir):
if not remove_tmp_dir:
raise RuntimeError(
'Cannot create temporary directory, path exists: %s' % tmp_dir)
subprocess.call(['rm', '-rf', tmp_dir])
# load config from file
config = pipeline_pb2.TrainEvalPipelineConfig()
with open(config_path, 'r') as f:
text_format.Merge(f.read(), config, allow_unknown_extension=True)
# override some config parameters
if config.model.HasField('ssd'):
config.model.ssd.feature_extractor.override_base_feature_extractor_hyperparams = True
if override_nms_score_threshold is not None:
config.model.ssd.post_processing.batch_non_max_suppression.score_threshold = override_nms_score_threshold
if override_resizer_shape is not None:
config.model.ssd.image_resizer.fixed_shape_resizer.height = override_resizer_shape[
0]
config.model.ssd.image_resizer.fixed_shape_resizer.width = override_resizer_shape[
1]
elif config.model.HasField('faster_rcnn'):
if override_nms_score_threshold is not None:
config.model.faster_rcnn.second_stage_post_processing.score_threshold = override_nms_score_threshold
if override_resizer_shape is not None:
config.model.faster_rcnn.image_resizer.fixed_shape_resizer.height = override_resizer_shape[
0]
config.model.faster_rcnn.image_resizer.fixed_shape_resizer.width = override_resizer_shape[
1]
print("config.model.ssd.image_resizer.fixed_shape_resizer")
print(config.model.ssd.image_resizer.fixed_shape_resizer)
tf_config = tf.ConfigProto()
tf_config.gpu_options.allow_growth = True
# export inference graph to file (initial), this will create tmp_dir
with tf.Session(config=tf_config):
with tf.Graph().as_default():
exporter.export_inference_graph(
INPUT_NAME,
config,
checkpoint_path,
tmp_dir,
input_shape=[max_batch_size, None, None, 3])
# read frozen graph from file
frozen_graph_path = os.path.join(tmp_dir, FROZEN_GRAPH_NAME)
frozen_graph = tf.GraphDef()
with open(frozen_graph_path, 'rb') as f:
frozen_graph.ParseFromString(f.read())
# apply graph modifications
if force_nms_cpu:
frozen_graph = f_force_nms_cpu(frozen_graph)
if replace_relu6:
frozen_graph = f_replace_relu6(frozen_graph)
if remove_assert:
frozen_graph = f_remove_assert(frozen_graph)
# get input names
output_names = [BOXES_NAME, CLASSES_NAME, SCORES_NAME, NUM_DETECTIONS_NAME]
# optionally perform TensorRT optimization
if use_trt:
with tf.Graph().as_default() as tf_graph:
with tf.Session(config=tf_config) as tf_sess:
frozen_graph = trt.create_inference_graph(
input_graph_def=frozen_graph,
outputs=output_names,
max_batch_size=max_batch_size,
max_workspace_size_bytes=max_workspace_size_bytes,
precision_mode=precision_mode,
minimum_segment_size=minimum_segment_size)
# perform calibration for int8 precision
if precision_mode == 'INT8':
if calib_images_dir is None:
raise ValueError('calib_images_dir must be provided for int8 optimization.')
tf.import_graph_def(frozen_graph, name='')
tf_input = tf_graph.get_tensor_by_name(INPUT_NAME + ':0')
tf_boxes = tf_graph.get_tensor_by_name(BOXES_NAME + ':0')
tf_classes = tf_graph.get_tensor_by_name(CLASSES_NAME + ':0')
tf_scores = tf_graph.get_tensor_by_name(SCORES_NAME + ':0')
tf_num_detections = tf_graph.get_tensor_by_name(
NUM_DETECTIONS_NAME + ':0')
image_paths = glob.glob(os.path.join(calib_images_dir, '*.jpg'))
image_paths = image_paths[0:num_calib_images]
for image_idx in tqdm.tqdm(range(0, len(image_paths), max_batch_size)):
# read batch of images
batch_images = []
for image_path in image_paths[image_idx:image_idx+max_batch_size]:
image = _read_image(image_path, calib_image_shape)
batch_images.append(image)
# execute batch of images
boxes, classes, scores, num_detections = tf_sess.run(
[tf_boxes, tf_classes, tf_scores, tf_num_detections],
feed_dict={tf_input: batch_images})
pdb.set_trace()
frozen_graph = trt.calib_graph_to_infer_graph(frozen_graph)
# re-enable variable batch size, this was forced to max
# batch size during export to enable TensorRT optimization
for node in frozen_graph.node:
if INPUT_NAME == node.name:
node.attr['shape'].shape.dim[0].size = -1
# write optimized model to disk
if output_path is not None:
with open(output_path, 'wb') as f:
f.write(frozen_graph.SerializeToString())
# remove temporary directory
subprocess.call(['rm', '-rf', tmp_dir])
return frozen_graph
To evaluate the model, we are going to create a detect_frames function which runs the detection model over a set of images and measure performance of the resulting detections.
In the function below, we load the graph, create a session and loop through the feedforward function. The algorithm also provides scores, bounding box locations and classes for a predefined number of proposals we can alter. The function layers these bounding box proposals (green).
from object_detection.utils import label_map_util
from object_detection.utils import visualization_utils as vis_util
def detect_frames(path_to_labels,
data_folder,
output_path):
# We load the label maps and access category names and their associated indicies
label_map = label_map_util.load_labelmap(path_to_labels)
categories = label_map_util.convert_label_map_to_categories(label_map, max_num_classes=1, use_display_name=True)
category_index = label_map_util.create_category_index(categories)
print('Starting session...')
with detection_graph.as_default():
with tf.Session(graph=detection_graph) as sess:
# Define input and output Tensors for detection_graph
image_tensor = detection_graph.get_tensor_by_name('image_tensor:0')
# Each box represents a part of the image where a particular object was detected.
detection_boxes = detection_graph.get_tensor_by_name('detection_boxes:0')
# Each score represents the level of confidence for each of the objects.
# Score is shown on the resulting image, together with the class label.
detection_scores = detection_graph.get_tensor_by_name('detection_scores:0')
detection_classes = detection_graph.get_tensor_by_name('detection_classes:0')
num_detections = detection_graph.get_tensor_by_name('num_detections:0')
frames_path = data_folder
xml_path = join(data_folder, 'xml')
num_frames = len([name for name in
os.listdir(frames_path)
if os.path.isfile(join(frames_path, name))])
reference_image = os.listdir(data_folder)[0]
image = cv2.imread(join(data_folder, reference_image))
height, width, channels = image.shape
number_of_tests = 10
counter = 1
total_time = 0
print('Running Inference:')
for fdx, file_name in \
enumerate(sorted(os.listdir(data_folder))):
image = cv2.imread(join(frames_path, file_name))
image_np = np.array(image)
# Expand dimensions since the model expects images to have shape: [1, None, None, 3]
image_np_expanded = np.expand_dims(image_np, axis=0)
tic = time.time()
(boxes, scores, classes, num) = sess.run(
[detection_boxes, detection_scores, detection_classes, num_detections],
feed_dict={image_tensor: image_np_expanded})
toc = time.time()
t_diff = toc - tic
total_time = total_time + t_diff
# Visualization of the results of a detection.
vis_util.visualize_boxes_and_labels_on_image_array(
image,
np.squeeze(boxes),
np.squeeze(classes).astype(np.int32),
np.squeeze(scores),
category_index,
use_normalized_coordinates=True,
line_thickness=7,
min_score_thresh=0.5)
cv2.imwrite(join(output_path, file_name), image)
plt.figure(figsize=(6,3))
plt.imshow(image.astype(np.uint8))
plt.axis('off')
plt.show()
prog = 'Completed current frame in: %.3f seconds. %% (Total: %.3f secconds)' % (t_diff, total_time)
print('{}\r'.format(prog))
counter = counter + 1
if counter > number_of_tests:
break
Now let's test the model before optimization. First, we need to load the graph:
path_to_graph = join('models', models[test_config['source_model']['model_name']].extract_dir,
'frozen_inference_graph.pb')
# Import a graph by reading it as a string, parsing this string then importing it using the tf.import_graph_def command
print('Importing graph...')
detection_graph = tf.Graph()
with detection_graph.as_default():
od_graph_def = tf.GraphDef()
with tf.gfile.GFile(path_to_graph, 'rb') as fid:
serialized_graph = fid.read()
od_graph_def.ParseFromString(serialized_graph)
tf.import_graph_def(od_graph_def, name='')
print('Importing graph completed')
Next, we call the test the function by providing the labels and directory containing test images prior to optimizing the graph.
PATH_TO_LABELS = 'tensorrt/coco/mscoco_label_map.pbtxt'
PATH_TO_TEST_IMAGES_DIR = 'tensorrt/coco/CS' #Change the dataset and view the detections
OUT_PATH = 'temp'
detect_frames(PATH_TO_LABELS, PATH_TO_TEST_IMAGES_DIR, OUT_PATH)
Note the time spent on the inference. Also notice that here we are feeding images individually. The for loop within the detect_frames function, reads one image at a time and pass it through the inference session. In Exercise 5, you are asked to run the inference on a batch size of 32 and compare the results with single image inference.
Now we are moving to the optimization step:
# optimize model using source model
frozen_graph_optimized = optimize_model(
config_path=config_path,
checkpoint_path=checkpoint_path,
**test_config['optimization_config'])
Finally, we load the saved optimized graph using tf.gfile.GFile and run the detect frames procedure:
path_to_graph = join('models', models[test_config['source_model']['model_name']].extract_dir,
'frozen_inference_graph.pb')
# Import a graph by reading it as a string, parsing this string then importing it using the tf.import_graph_def command
print('Importing graph...')
detection_graph = tf.Graph()
with detection_graph.as_default():
od_graph_def = tf.GraphDef()
with tf.gfile.GFile(path_to_graph, 'rb') as fid:
tf.import_graph_def(frozen_graph_optimized, name='')
print('Importing graph completed')
PATH_TO_LABELS = 'tensorrt/coco/mscoco_label_map.pbtxt'
PATH_TO_TEST_IMAGES_DIR = 'tensorrt/coco/CS' #Change the dataset and view the detections
OUT_PATH = 'temp'
detect_frames(PATH_TO_LABELS, PATH_TO_TEST_IMAGES_DIR, OUT_PATH)
How do you compare the results of the optimized model to the non-optimized version? Can you distinguish any visual difference?
Repeat the benchmarking for the ssd_mobilenet_v1_ppn_coco model and compare the speed up gains. To achieve that, you need to open the model_config.json file and modify model_name, image_shape and output_path parameters. Afterwards, follow the same steps as ssdlite_mobilenet_v2_coco model and compare the results.
file from the menu and clicking on the save button.Your answer:
Previously, within the classification task, we saw how choosing a proper batch size could affect the classification time . In our detection example, we simulated a case where data feed is live and sourced from a single camera (we processed one image at a time during each run of the tf.session object). While the time improvement was again significant, we can do better if we optimize and then infer images in batches. Consider a case where you have multiple cameras, and at each epoch, several frames are queued and fed into your inference engine.
In this exercise, you will have to modify the optimization for batch sizes of 16 and 32 and then modify the detect frames to perform inference in batches of the same size and then compare the inference time improvement over a single frame inference.
# PERFORM THE COMPARISON HERE (average inference time):
Batch \ Precision |
size | FP16 FP32
------------------------------------------------------------
|
1 (summed |
over 32 runs) |
|
|
32 (single run) |
|
Congratulations on completing the Deep Learning Optimization and Deployment using TensorRT! If you have any spare time left, please alter the script above to experiment with models from the following list:
ssd_mobilenet_v1_0p75_depth_quantized_coco
ssd_mobilenet_v1_ppn_coco
ssd_mobilenet_v2_coco
ssdlite_mobilenet_v2_coco
ssd_inception_v2_coco
ssd_resnet_50_fpn_coco
faster_rcnn_resnet50_coco
faster_rcnn_nas
mask_rcnn_resnet50_atrous_coco
facessd_mobilenet_v2_quantized_open_image_v4
What we have learned;
We hope to see you in future courses where you can learn more about usage of the optimized models in real-life applications like Intelligent Video Analytics, Natural Language Processing and Healthcare.
In the following blog post, you will learn how to deploy a deep learning application onto a GPU, increasing throughput and reducing latency during inference:
In the next blog post, it is shown how TensorRT Integration specifically Speeds Up TensorFlow Inference:
The following blog post, introduces TensorRT as a high-performance deep learning inference library for production environments:
Official documentation on how to accelerate inference in TensorFlow with TensorRT (TF-TRT):
TensorRT Developer Guide:
The following video demonstrates how to configure a simple Recurrent Neural Network (RNN) based on the character-level language model using NVIDIA TensorRT:
p_mode = 'INT8'
trt_graph = trt.create_inference_graph(
input_graph_def=frozen_graph,
outputs=output_names,
max_batch_size=32,
max_workspace_size_bytes=1 << 25,
precision_mode=p_mode,
minimum_segment_size=50)
tf.reset_default_graph()
tf_config = tf.ConfigProto()
tf_config.gpu_options.allow_growth = True
tf_sess = tf.Session(config=tf_config)
tf.import_graph_def(trt_graph, name='')
tf_input = tf_sess.graph.get_tensor_by_name(input_names[0] + ':0')
tf_output = tf_sess.graph.get_tensor_by_name(output_names[0] + ':0')
run_test(32)
calibrated_model = trt.calib_graph_to_infer_graph(trt_graph)
tf.reset_default_graph()
tf_sess = tf.Session(config=tf_config)
tf.import_graph_def(calibrated_model, name='')
tf_input = tf_sess.graph.get_tensor_by_name(input_names[0] + ':0')
tf_output = tf_sess.graph.get_tensor_by_name(output_names[0] + ':0')
run_test(32) #SLOW!!
run_test(32) #FAST!!
Click here to go back